Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipes for open vocabulary keyword spotting #1428

Merged
merged 18 commits into from
Feb 22, 2024
Merged

Conversation

pkufool
Copy link
Collaborator

@pkufool pkufool commented Dec 25, 2023

This is a initial version of decoder for open vocabulary keyword spotting system, the idea is almost the same as the context biasing system we proposed before, I improve the ContextGraph to make users can trade off recall and precision easily.

I also trained some small zipformer models (around 3M parameters) on gigaspeech (for English) and wenetspeech (for Chinese) for keyword spotting purpose, will update the results and models in the following commits soon.

@wangtiance
Copy link
Contributor

Hello, what's the current progress of this PR? Thanks!

@pkufool
Copy link
Collaborator Author

pkufool commented Jan 15, 2024

Hello, what's the current progress of this PR? Thanks!

Developing the runtime first, see k2-fsa/sherpa-onnx#505 , will clean up this PR soon.

@pkufool pkufool changed the title [WIP] decoder for open vocabulary keyword spotting [WIP] recipe for open vocabulary keyword spotting Jan 19, 2024
@alucassch
Copy link

Would it be possible to implement a KWS system using the output from the CTC branch, transforming it into a lattice to utilize Kaldi's decoders? Similar to what is done with kaldi-decoder/faster-decoder.h and kaldi-decoder/decodable-ctc.h? What part of the kaldi code must be implemented in kaldi-decoder in order to achieve that could you give me some direction?

@pkufool
Copy link
Collaborator Author

pkufool commented Feb 1, 2024

@alucassch I indeed have the plan to use ctc branch, but I think I won't use the kaldi decoders. As for using the kaldi decoders, you can compile the keywords into a lattice, than decode the audios with this lattice (faster decoder is enough, I think), then for each frame (or chunk) you can match the suffix of decoded results with keywords candidates, if matching and the logprob is larger than given threshold the corresponding keyword is triggered. Sorry, I don't have much experience in this direction, so here is just my thought, you can try it yourself.

@pkufool
Copy link
Collaborator Author

pkufool commented Feb 20, 2024

Here are some results of this PR, you can find more details in the RESULTS.md of each recipe.

English

The positive set is from https://github.com/pkufool/open-commands the negative set is the test set of gigaspeech.

Each metric has two columns, one for original model trained on gigaspeech, the other for finetune model.

small

Commands FN in positive set FN in positive set Recall Recall FP in negative set FP in negative set False alarm (time / hour) 40 hours False alarm (time / hour) 40 hours
  original finetune original finetune original finetune original finetune
All 43/307 4/307 86% 98.7% 1 24 0.025 0.6
Lights on 6/17 0/17 64.7% 100% 1 9 0.025 0.225
Heat up 5/14 1/14 64.3% 92.9% 0 1 0 0.025
Volume down 4/18 0/18 77.8% 100% 0 2 0 0.05
Volume max 4/17 0/17 76.5% 100% 0 0 0 0
Volume mute 4/16 0/16 75.0% 100% 0 0 0 0
Too quiet 3/17 0/17 82.4% 100% 0 4 0 0.1
Lights off 3/17 0/17 82.4% 100% 0 2 0 0.05
Play music 2/14 0/14 85.7% 100% 0 0 0 0
Bring newspaper 2/13 1/13 84.6% 92.3% 0 0 0 0
Heat down 2/16 2/16 87.5% 87.5% 0 1 0 0.025
Volume up 2/18 0/18 88.9% 100% 0 1 0 0.025
Too loud 1/13 0/13 92.3% 100% 0 0 0 0
Resume music 1/14 0/14 92.9% 100% 0 0 0 0
Bring shoes 1/15 0/15 93.3% 100% 0 0 0 0
Switch language 1/15 0/15 93.3% 100% 0 0 0 0
Pause music 1/15 0/15 93.3% 100% 0 0 0 0
Bring socks 1/12 0/12 91.7% 100% 0 0 0 0
Stop music 0/15 0/15 100% 100% 0 0 0 0
Turn it up 0/15 0/15 100% 100% 0 3 0 0.075
Turn it down 0/16 0/16 100% 100% 0 1 0 0.025

large

Commands FN in positive set FN in positive set Recall Recall FP in negative set FP in negative set False alarm (time / hour)23 hours False alarm (time / hour)23 hours
  original finetune original finetune original finetune original finetune
All 622/3994 79/ 3994 83.6% 97.9% 18/19930 52/19930 0.45 1.3

Chinese

The positive set is from https://github.com/pkufool/open-commands the negative set is the test-net set of wenetspeech.

Each metric has two columns, one for original model trained on wenetspeech, the other for finetune model.

small

Commands FN in positive set FN in positive set Recall Recall FP in negative set FP in negative set False alarm (time / hour)23 hours False alarm (time / hour)23 hours
  original finetune original finetune original finetune original finetune
All 426 / 985 40/985 56.8% 95.9% 7 1 0.3 0.04
下一个 5/50 0/50 90% 100% 3 0 0.13 0
开灯 19/49 2/49 61.2% 95.9% 0 0 0 0
第一个 11/50 3/50 78% 94% 3 0 0.13 0
声音调到最大 39/50 7/50 22% 86% 0 0 0 0
暂停音乐 36/49 1/49 26.5% 98% 0 0 0 0
暂停播放 33/49 2/49 32.7% 95.9% 0 0 0 0
打开卧室灯 33/49 1/49 32.7% 98% 0 0 0 0
关闭所有灯 27/50 0/50 46% 100% 0 0 0 0
关灯 25/48 2/48 47.9% 95.8% 1 1 0.04 0.04
关闭导航 25/48 1/48 47.9% 97.9% 0 0 0 0
打开蓝牙 24/47 0/47 48.9% 100% 0 0 0 0
下一首歌 21/50 1/50 58% 98% 0 0 0 0
换一首歌 19/50 5/50 62% 90% 0 0 0 0
继续播放 19/50 2/50 62% 96% 0 0 0 0
打开闹钟 18/49 2/49 63.3% 95.9% 0 0 0 0
打开音乐 17/49 0/49 65.3% 100% 0 0 0 0
打开导航 17/48 0/49 64.6% 100% 0 0 0 0
打开电视 15/50 0/49 70% 100% 0 0 0 0
大点声 12/50 5/50 76% 90% 0 0 0 0
小点声 11/50 6/50 78% 88% 0 0 0 0

large and others

Commands FN in positive set FN in positive set Recall Recall FP in negative set FP in negative set False alarm (time / hour)23 hours False alarm (time / hour)23 hours
  original finetune original finetune original finetune original finetune
large 2429/4505 477 / 4505 46.1% 89.4% 50 41 2.17 1.78
小云小云(clean) 30/100 40/100 70% 60% 0 0 0 0
小云小云(noisy) 118/350 154/350 66.3% 56% 0 0 0 0
你好问问(finetune with all keywords data) 2236/10641 678/10641 79% 93.6% 0 0 0 0
你好问问(finetune with only 你好问问) 2236/10641 249/10641 79% 97.7% 0 0 0 0

@pkufool pkufool changed the title [WIP] recipe for open vocabulary keyword spotting Recipes for open vocabulary keyword spotting Feb 20, 2024
@pkufool pkufool merged commit aac7df0 into k2-fsa:master Feb 22, 2024
73 of 93 checks passed
@lonngxiang
Copy link

lonngxiang commented Feb 25, 2024

麻烦问下这有支持python接口使用吗,看文档https://k2-fsa.github.io/sherpa/onnx/kws/pretrained_models/index.html#sherpa-onnx-kws-pre-trained-models不是很清楚

@pkufool
Copy link
Collaborator Author

pkufool commented Mar 1, 2024

@wangtiance
Copy link
Contributor

  --decoder-dim 320 \
  --joiner-dim 320 \
  --num-encoder-layers 1,1,1,1,1,1 \
  --feedforward-dim 192,192,192,192,192,192 \
  --encoder-dim 128,128,128,128,128,128 \
  --encoder-unmasked-dim 128,128,128,128,128,128 \

Are these numbers the result of extensive search, or chosen with some intuition? Thanks!

@pkufool
Copy link
Collaborator Author

pkufool commented Mar 7, 2024

  --decoder-dim 320 \
  --joiner-dim 320 \
  --num-encoder-layers 1,1,1,1,1,1 \
  --feedforward-dim 192,192,192,192,192,192 \
  --encoder-dim 128,128,128,128,128,128 \
  --encoder-unmasked-dim 128,128,128,128,128,128 \

Are these numbers the result of extensive search, or chosen with some intuition? Thanks!

No, just with some intuition. We are searching for better models also smaller models.

@manmanfu
Copy link

您好,请问有提供kws微调前的预训练pt模型吗?只看到了onnx的。

@pkufool
Copy link
Collaborator Author

pkufool commented Jul 25, 2024

您好,请问有提供kws微调前的预训练pt模型吗?只看到了onnx的。

看 results.md, 里面有链接。

@KIM7AZEN
Copy link
Contributor

There is a small bug in egs/wenetspeech/ASR/prepare.sh (line164 172 180)
it use a fixed num_splits 1000 , should be pieces=$(find data/fbank/M_split_${num_splits} -name "cuts_M.*.jsonl.gz")
image

@pkufool
Copy link
Collaborator Author

pkufool commented Oct 16, 2024

@KIM7AZEN Thanks! Could you make a PR to fix it.

@KIM7AZEN
Copy link
Contributor

@KIM7AZEN Thanks! Could you make a PR to fix it.

ok. wait a moment

@zhuangweiji
Copy link
Contributor

zhuangweiji commented Nov 1, 2024

Chinese

small

Commands FN in positive set FN in positive set Recall Recall FP in negative set FP in negative set False alarm (time / hour) 23 hours False alarm (time / hour) 23 hours
original finetune original finetune original finetune original finetune
All 426 / 985 40/985 56.8% 95.9% 7 1 0.3 0.04

large and others

Commands FN in positive set FN in positive set Recall Recall FP in negative set FP in negative set False alarm (time / hour) 23 hours False alarm (time / hour) 23 hours
--- original finetune original finetune original finetune original finetune
large 2429/4505 477 / 4505 46.1% 89.4% 50 41 2.17 1.78

What do 'small' and 'large and others' mean here? Are they referring to the size of the models or the sizes of different test sets? Why does the larger one seem to perform worse than the smaller one?

@pkufool
Copy link
Collaborator Author

pkufool commented Nov 26, 2024

@zhuangweiji Test sets, you can see https://github.com/k2-fsa/icefall/blob/master/egs/wenetspeech/KWS/RESULTS.md for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants